中国科技核心期刊

中文核心期刊

CSCD来源期刊

空间控制技术与应用 ›› 2024, Vol. 50 ›› Issue (3): 42-51.doi: 10.3969/j.issn.1674 1579.2024.03.005

• 论文与报告 • 上一篇    下一篇

基于强化学习的动目标协同观测任务自主规划方法

  

  1. 北京控制工程研究所
  • 出版日期:2024-06-25 发布日期:2024-09-27
  • 基金资助:
    国家自然科学基金资助项目(U21B600005和62303048)

Autonomous Mission Planning of Collaborative Observation for Moving Targets Based on Reinforcement Learning

  • Online:2024-06-25 Published:2024-09-27

摘要: 随着空间目标的数量逐渐增多、空中目标动态性日趋提升,对目标的观测定位问题变得愈发重要.由于需同时观测的目标多且目标动态性强,而星座观测资源有限,为了更高效地调用星座观测资源,需要动态调整多目标协同观测方案,使各目标均具有较好的定位精度,因此需解决星座协同观测多目标的任务规划问题.建立星座姿态轨道模型、目标飞行模型、目标协同探测及定位模型,提出基于几何精度衰减因子(geometric dilution of precision, GDOP)的目标观测定位误差预估模型及目标观测优先级模型,建立基于强化学习的协同观测任务规划框架,采用多头自注意力机制建立策略网络,以及近端策略优化算法开展任务规划算法训练.仿真验证论文提出的方法相比传统启发式方法提升了多目标观测精度和有效跟踪时间,相比遗传算法具有更快的计算速度.

关键词: 多目标, 协同观测, 任务规划, 强化学习, 自注意力机制, 近端策略优化

Abstract: With the increasing number of space targets, the problem of orbit determination of the targets is becoming increasingly important for space security. Due to the large number and dynamic feature of the space targets that need to be observed, coupled with limited observation resources, it is necessary to dynamically adjust the collaborative observation scheme to efficiently utilize constellation observation resources and ensure that each target has better positioning accuracy. Thus, it is required to solve the mission planning problem of multiple targets using multiple observation satellites. This paper first establishes the orbit dynamic model of the flying targets, as well as the Kalman filter model of the collaborative positioning algorithm using the multiple line of sight information of different observation satellites. Then, a collaborative positioning accuracy estimation model and an observation priority model of the targets based on the Geometric Dilution of Precision (GDOP) is proposed. Based on the above models, a mission planning framework for collaborative observation based on reinforcement learning (RL) is developed. A policy network based on multihead selfattention mechanism is designed accordingly to calculate the planning results. The proximal policy optimization (PPO) algorithm is adopted to train the policy network in a training environment. Compared with the heuristic algorithm based on tracking priority, simulation results shows that the proposed RL method can effectively improve the overall tracking accuracy as well as the total tracking time of all the targets, and can provide faster computation speed compared to genetic algorithms.

Key words: multiple targets, collaborative observing, mission planning, reinforcement learning, selfattention mechanism, proximal policy optimization algorithm

中图分类号: 

  • V44